K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing model assumptions, because all covariate effects have been removed. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij MSE (standardized) studentized residuals ɛ ij = ˆɛ ij MSE(1 1/ri ) approx N approx N studentized deleted residuals ɛ ij = ˆɛ ij SSE(1 1/r i ) ˆɛ 2 ij N t 1 approx t N t 1 = Y ij Ŷ i( j) Ŷ i( j) fitted mean Ȳ i from a model fit after deleting Y ij. 157
SAS code in OUTPUT line S-PLUS or R code Residual of PROC GLM fit = lm(y factor(x)) ˆɛ ij R or RESIDUAL fit$residuals ω ij calculate in a DATA step resids/summary(fit)$sigma ɛ ij STUDENT see code file lmwork.s ɛ ij RSTUDENT see code file lmwork.s Some properties: i j ˆɛ ij = 0 thus the ˆɛ ij are not... Var[ˆɛ ij ] = Var[Y ij Ȳ i ] = σ2 (r i 1) the model is correct. r i σ 2 but Var[ ɛ ij ] = 1 when 158
What do we need to check? We will use the studentized residuals ɛ ij for most diagnostics. 159
Tools - Plots Plot residuals versus fitted values - look for outliers - look for an even scatter of points above and below the horizontal at zero (indicating homoscedasticity) - if the r i are small, also plot residuals versus fitted values stem-and-leaf or histogram of residuals - look for outliers - look for approximate symmetry around 0 - look for approximate bell shape 160
normal probability plot of residuals or normal quantile plot of residuals - look for the residuals to follow the standard normal straight line spread-location plot of residuals versus fitted values - look for an even vertical scatter of points - superimpose the within-group median{ residuals } and look for any trend across groups plot residuals versus observation number, or plot in the order in which the data were collected - look for a random scatter of points - any trend may indicate lack of independence 161
plot residuals versus any predictor omitted from the model - look for a random scatter of points around the horizontal at 0, which indicates the predictor is not needed in the model 162
Tools - Statistics Outliers - ɛ ij > 3-68% of ˆɛ ij should fall within ( 1,1) - 90% of ˆɛ ij should fall within ( 2,2) Normality - skewness of residuals should be 0 - kurtosis of residuals should be 3 (note that SAS PROC UNIVARIATE gives kurtosis minus 3) 163
Tools - Tests Outliers - if max {ɛ ij } > t α/2n,n t 1 then Y ij is an outlier. Why use type I error of α/2n? Normality - reject H 0 : residuals are normally distributed at level α if corr(ˆɛ ij, E[ˆɛ ij ]) < q α from the table for the correlation test for normality. What is E[ˆɛ ij ]? 164
Homoscedasticity - Hartley test (Fmax): if the assumptions of independence and normality hold and r i r i then we can test H 0 : σ 2 1 = σ2 2 = = σ2 t versus H A : not all σ 2 i are equal for σ 2 i = Var[Y ij]. Reject H 0 at level α if F max = max(s2 i ) min(s 2 i ) > Fmax α,t,r 1. F max is a distribution derived for this test and can be found in tables. If the r i are close but not all equal, use df= 1 t not r 1. i(r i 1) 165
- Modified Levene (Median) test: if the assumption of independence holds and of normality approximately holds, we can test H 0 : σ 2 1 = σ2 2 = = σ2 t versus H A: not all σ 2 i are equal using the data medians: Ỹ i = median j i {Y ij }. 1. Compute z ij = Y ij Ỹ i 2. Fit a oneway ANOVA using the z ij 3. Reject H 0 at level α if F = MST MSE > F α,t 1,N t 166
L. Remedial Measures What is the effect of the failure of the one-way ANOVA model assumptions? Moderate lack of normality will lead to only a slight loss of power. Kurtosis has a greater impact on power than skewmess. ˆµ i, Ĉ, and MSE are unbiased with or without normally distributed errors. If r i are not all equal, then a violation of homoscedasticity can affect the power of the F-test. If the r i are approximately equal, then non-constant variance will only have a mild impact on the F-test. 167
Violation of the independence assumption is potentially the most serious, especially if the ignored correlation is large (ρ > 0.5). An ignored positive correlation will give variance estimates (e.g., Var[ˆµ i ], ˆ Var[Ĉ]) that are too small, thus null hypotheses may be rejected when they should not be. Outliers usually do not have a big impact since the F-test is fairly robust to skewness. Omitting important covariates can have a large impact on the estimated means and their interpretation, and consequently on the F-test as well. Violation of normality has a larger impact on confidence intervals than on F-tests. 168
Remedial measures are methods we use to try to fix the violated assumptions. Outliers - Fit the model once with the outliers, and once without. Compare the two fitted models (ˆµ i, F-test, contrasts of interest). If they are not substantially different in terms of scientific conclusions, then leave the outliers in. - Always check to make sure outliers are not just the result of a data entry error, equipment malfunction, or miscalculation. If they are, then the outliers should be corrected or omitted. - If the two models give substantially different conclusions, then both sets of results should be reported, or an alternative analysis technique should be used. 169
Omitted covariates - If an omitted covariate appears to be important from a residual plot, then add it to the model and test it for statistical significance. - If you know an important covariate was omitted, but it was not collected, or you do not have access to it, there will be problems with model interpretation. Independence - If you know the source of the correlation, then you can fit a random effects model to adjust for it. - If you do not, then move to working independence estimates of a robust sandwich estimator in generalized estimating equations. 170
Normality is satisfied but homoscedasticity is not. Suppose the violation is such that ɛ ij iid N(0, σ 2 i ). Since the σi 2 are unknown, they will need to estimated using s 2 i = r 1 r i i 1 j=1 (Y ij Ȳ i ) 2. Having non-constant variance means that ˆµ i = Ȳ i no longer have minimum variance among all unbiased linear estimators. We must adjust for the groups with larger variances. How do we do that? 171
Instead of minimizing least squares i j(y ij µ i ) 2, we will minimize weighted least squares i j w ij (Y ij µ i ) 2, where w ij = We still get ˆµ i = Ȳ i, but our sums of squares will now be weighted as well. least squares SST = i r i (Ȳ i Ȳ ) 2 SSE = i j(y ij Ȳ i ) 2 weighted least squares SST = i r i(ȳ i Ȳ ) 2 SSE = i s 2 i j 1 s 2 i (Y ij Ȳ i ) 2 172
Now F = MSE MST will only have an approximate F distribution. Larger r i better approximation. Coding: SAS - WEIGHT statement in PROC GLM S-PLUS & R - lm(, weights = ) If you saw weighted least squares in regression, this is the same thing. We just need to write the ANOVA model in the regression parameterization, and use ˆβ = 173
Neither normality nor homoscedasticity are satisfied. (1) Transform the data, the Y ij values. Watch out for negative and 0 values, which affect how transformations can be done. (a) If σ 2 i = cµ i then try Y ij. Plot s 2 i versus Ȳ i and look for an increasing or decreasing linear trend. Or compute s 2 i /Ȳ i and look for them to take on a similar value i. (b) If σ i = cµ i then try log(y ij + k) for some small k. Plot s i versus Ȳ i or compute s i /Ȳ i as above. (c) If σ i = cµ 2 i then try 1 Y ij +k for some small k. Plot s i versus Ȳi 2 or compute s i /(Ȳ i ) 2 as above. 174
(d) If Y ij is a proportion then try log(y ij ) log(1 Y ij ), i.e., log odds. If the proportions come from differently sized samples, then also try weighted least squares. (e) If none of the above work, then try the Box - Cox procedure for finding an appropriate power transformation try a non-linear mdoel, e.g. generalized linear model or non-parametric regression. What are the diasadvantages of fitting models on a transformed outcome? 175